High-throughput whole transcriptome RNA sequencing (RNA-seq) technology has enabled a new era of biomedical and clinical research by providing unprecedented high-resolution view of the global transcriptional landscape(Kukurba and Montgomery 2015). For instance, the technology is used routinely to obtain the most comprehensive genome-wide view of the transcriptome, study differential expression patterns [cite many] and allele-specific expression (ASE), analyze splicing and alternative splicing mechanism [cite]. Beyond surveying gene expression levels, RNA-seq can also be applied to investigate genome-wide transcriptional changes, such as chimeric gene fusions, single nucleotide variants, insertions, and deletion events (Z. Wang, Gerstein, and Snyder 2009). Moreover, RNA-seq enabled many novel assays that can be used to to invistigate chromotaine accessibility (e.g ATAC-seq)(Buenrostro et al. 2015), transcriptor factor binding preferences (e.g ChIP-seq) (Park 2009), RNA binding proteins binding sites (e.g CLIP-seq) (Hafner et al. 2021). The capabilities of RNA-seq technology makes it an increasingly adapted technology for the systematic characterization of whole transcriptomes in many organisms and diseases (Marioni et al. 2008).
The consistency and the accuracy of RNA-seq data are critical to ensure that the conclusions of these studies are not flawed. Multiple library preparation steps may introduce potential factors that can confound the experiments, including sample preparation, sequencing platforms, and bioinformatics analysis tools. Many of these factors were studied in depth, but less attention was given to different sample preparation, including RNA isolation, sample handling, library storage time, RNA input level, and sample cryopreservation (L. Wang et al. 2019).
In this project, we investigate the potential consequences of cDNA library storage time and sample cryopreservation on gene level characterization of RNA-seq experiments. Our hypothesis is that RNA-seq data is sensitive to sample preparation factors and that can affect the gene expression profiles of the experiments..
Another brief paragraph summarizing your key insights and possible future experiments/analyses that might enhance your own analysis.
The analysis of the possible effects of library storage time and sample cryopreservation shows that the two factors has minor to no effect on the RNA-seq experiments. Figure 1 and 2 shows PCA plot for studying the impact of library storage time and sample cryopreservation, respectively. We see that the samples from the same patient are closest despite the library storage time and sample cryopreservation. The same observation is seen when looking at the hierarchical clustering of the samples as shown in Figure 3 and 4. For future analysis, we may consider adding more samples to the experiment to get a higher confidence on the results. Also, for the library storage time, it would be better to get samples from different time points to point out the main features of the RNA-seq data that change with time. For sample cryopreservation study, we may consider compare different freezing technologies to make a more informed decision when choosing to store samples for an extended period of time. Other than these two factors that were studied here, we can also look at other library preparation steps like, RNA input concentration to obtain a full picture of possible confounding factors.
Figure 1: PCA plot for library storage time impact
Figure 2: PCA plot for cryopreservation impact
Figure 3: Dendrogram cluster for library storage time impact
Figure 4: Dendrogram cluster for cryopreservation impact
The data that will be used for this project are from GSE110417. the data contains 21 samples, that are either B cells or CD4+ cells isolated from freshly drawn peripheral blood. 13 samples used to investigate the library storage time. The samples are grouped into two: the first group contains samples from 6 patients sequenced with original cDNA library, and the second group contains samples for the same patients but prepared with a cDNA library stored at -80 degrees for three years, with an addition of a sample from a patient prepared with a newly constructed cDNA library. Finally, to investigate the impact of the sample cryopreservation on RNA-seq experiments, fresh and cryopreserved CD4+ samples derived from 4 patients were used, totaling to 8 samples. The sample considered fresh were derived from freshly isolated cells and stored as cell lysates in Qiazol, while the samples considered frozen, peripheral blood mononuclear cells (PBMCs) were isolated from fresh whole blood then frozen in Cosmic calf serum containing 5% DMSO at −80 degrees. The cDNA libraries were prepared using mRNA TruSeq v.2 (Illumina) and Illumina HiSeq 2000 instrument was used to sequence all the samples.
Samples <- read.csv('SraRunTable.txt')
more_details <- read.csv('samples_title.csv')
Samples <- merge(x=Samples, y=more_details, by='Sample_geo_accession')
Samples <- Samples[, c('Sample_title','Run','Cell_type', 'AvgSpotLen', 'Platform', 'LibraryLayout','Instrument', 'LibrarySelection', 'LibrarySource')]
kable(Samples) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")| Sample_title | Run | Cell_type | AvgSpotLen | Platform | LibraryLayout | Instrument | LibrarySelection | LibrarySource |
|---|---|---|---|---|---|---|---|---|
| concentration_1 rep1 | SRR6703961 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_100 rep1 | SRR6703962 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_250 rep1 | SRR6703963 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_500 rep1 | SRR6703964 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_1 rep2 | SRR6703965 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_100 rep 2 | SRR6703966 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_100 rep2 | SRR6703967 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| concentration_500 rep2 | SRR6703968 | B cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_101SR | SRR6703969 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_101FR | SRR6703970 | CD4+ T cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_103SR | SRR6703971 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_103FR | SRR6703972 | CD4+ T cell | 50 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_104SR | SRR6703973 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_104FR | SRR6703974 | CD4+ T cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_107SR | SRR6703975 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_107FR | SRR6703976 | CD4+ T cell | 100 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_107SRN | SRR6703977 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_109SR | SRR6703978 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_109FR | SRR6703979 | CD4+ T cell | 50 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_110SR | SRR6703980 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| time_110FR | SRR6703981 | CD4+ T cell | 50 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_501 | SRR6703982 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_501f | SRR6703983 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_506 | SRR6703984 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_506f | SRR6703985 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_507 | SRR6703986 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_507f | SRR6703987 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_510 | SRR6703988 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
| frozen_510f | SRR6703989 | CD4+ T cell | 102 | ILLUMINA | PAIRED | Illumina HiSeq 2000 | cDNA | TRANSCRIPTOMIC |
For my project the data I intend to use is stored in SRA repository. To download the data from SRA, I used sra-toolkit to fetch and extract the fastq files. First I obtained the SRR accession list for all the fastq files for the project. I obtained the list from SRA Run Selector using the GEO accession number: GSE110417 here.
The following command is used to download all the files:
prefetch --option-file SRR_Acc_List.txtThe downloaded files are not a proper .fastq files that can be processed, instead SRA-toolkit downloads .sra (Sequence Read Archive) files. The next step fastq-dump tool (there is a faster tool named fasterq-dump that was not found on the cluster) is use to extract the fastq files from these archive files. The following script is used to extract all the files:
#! /bin/bash -l
#SBATCH --partition=angsd_class #SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=SRATofastq
#SBATCH --time=24:00:00 # HH/MM/SS
#SBATCH --mem=8G # memory requested, units available: K,M,G,T
outFile=${DownloadSRA}_${SLURM_JOB_ID}.txt
echo "Starting at:" `date` >> ${outFile}
echo "This is job #:" $SLURM_JOB_ID >> ${outFile}
echo "Running on node:" `hostname` >> ${outFile}
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outFile}
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outFile}
spack load -r sra-toolkit
for file in *.sra; do
echo "File Name: " $file >> ${outFile}
fastq-dump --split-files --gzip $file
done
exitThe parameters that are used in the previous step are --split-files and --gzip. The first one is used to save the pair-end reads to two separate files, and the second is used to save the extracted fastq file to gzip compressed file. The documentation from the tool are displayed below:
--split-files Dump each read into separate file.Files
will receive suffix corresponding to read
number
--gzip Compress output using gzip First, I took a look at the FastQC analysis of one sample and found that there are were two sequences over-represented, so I decided to run Trimglore on all sequences. Figures 5-9 shows the results of FastQC analysis for all the library storage time samples reported using MultiQC. The samples for cryopreservation experiments shows the same trends as well. Figure 10 shows the length of trimmed sequences after running Trimglore.
Figure 5:
Figure 6:
Figure 7:
Figure 8:
Figure 9:
Figure 10:
Next step was to create the index files for our specie’s genome to align the reads. Since the samples for the project are from human, the human genome hg38 build was downloaded. For the annotation file we have different choices that we can use depending on the use case. Since we are not dealing with special annotation Ensemble gene annotation file hg38.ensGene.gtf.gz was used. The files were download from UCSC Browser. The following script was used to generate the index:
#! /bin/bash -l
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --job-name=IndexGenerator
#SBATCH --time=12:00:00 # HH/MM/SS
#SBATCH --mem=36G # memory requested, units available: K,M,G,T
outFile=IndexAlign_${SLURM_JOB_ID}.txt
echo "Starting at:" `date` >> ${outFile}
echo "This is job #:" $SLURM_JOB_ID >> ${outFile}
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outFile}
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outFile}
spack load star@2.7.0e
mkdir ${TMPDIR}/ilb4001
tmp=/scratchLocal/ilb4001
echo "Copying files to" ${tmp} >> ${outFile}
# Copy genome and annotation file
cp $SLURM_SUBMIT_DIR/hg38.fa ${SLURM_SUBMIT_DIR}/hg38.ensGene.gtf ${tmp}
# Copy sample read fastq files
cp $SLURM_SUBMIT_DIR/fastq_trimmed/SRR6703962_1_val_1.fq ${SLURM_SUBMIT_DIR}/fastq_trimmed/SRR6703962_2_val_2.fq ${tmp}
# Check that all the files were copied without an error
echo "tmp content..."
ls ${tmp} | cat >> ${outFile}
############################Indexing######################################
resultFolder=${tmp}/hg38_STARindex
mkdir ${resultFolder}
echo "STAR genomeGenerate...." >> ${outFile}
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir ${resultFolder} \
--genomeFastaFiles hg38.fa \
--sjdbGTFfile hg38.ensGene.gtf \
--sjdbOverhang 49 \
|& tee ${outFile}
echo "copying the generated index...." >> ${outFile}
cp -r ${resultFolder}/* ${SLURM_SUBMIT_DIR}/genome/hg38_STARindex/Apart from the input parameters, I set the --sjdbOverhang parameter to 49 since the length of the reads is 50 bp.
After generating the index for the reference genome, the next step was to align the reads. For that I used STAR alignment tool. Note: the files are pair-end reads. Figure 11 shows the percentages of different aligned reads. We can see that more than 95% of the reads were mapped, and a small percentage of the reads was not mapped. The script for aligning and generating the QC plots is in the full script at the end of the section.
Figure 11: Percentages of STAR aligned reads
To get the distribution of mapped read counts in the different genomic regions, I used the annotation bed file for housekeeping genes hg38.HouseKeepingGenes.bed. The reason for doing that is to speed up the process, as using the annotations for the whole genome was taking way too much time to process. The results of the analysis is discussed in the following sections.
Once the reads were aligned, I used featureCounts tool to generate the feature count for each sample in my dataset. The main parameters I set for the tool are:
-a: the gtf annotation file was set to hg38.ensGene.gtf-t: the feature level to use for reads counting was set to ‘exon’-p: Since we have pair-end reads, we count the fragments instead of reads. fragments are specified by the two paired-end readsThe script below was used to perform all the steps, except index generation.
#! /bin/bash -l
#SBATCH --partition=angsd_class
#SBATCH -w buddy.pbtech
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --job-name=ProcessAll
#SBATCH --time=48:00:00 # HH/MM/SS
#SBATCH --mem=36G # memory requested, units available: K,M,G,T
outFile=ProcessAll_${SLURM_JOB_ID}.txt
echo "SLURM_SUBMIT_DIR:" ${SLURM_SUBMIT_DIR} >> ${outFile}
echo "Starting at:" `date` >> ${outFile}
echo "This is job #:" $SLURM_JOB_ID >> ${outFile}
echo "Running on node:" `hostname` >> ${outFile}
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outFile}
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outFile}
spack load -r fastqc
spack load -r trimgalore
spack load star@2.7.0e
spack load samtools@1.9% gcc@6.3.0
spack load subread
spack load -r py-rseqc@2.6.4
mkdir /scratchLocal/ilb4001
tmp=/scratchLocal/ilb4001
fatqDir=${tmp}/raw_data
mkdir ${fatqDir}
echo "Copying fastq files to" ${fatqDir} >> ${outFile}
cp $SLURM_SUBMIT_DIR/raw_data/SRR67*.fastq.gz ${fatqDir}
mkdir ${tmp}/fastqc
FastQCDir=${tmp}/fastqc
echo "Start fastqc...." >> ${outFile}
fastqc $(ls ${fatqDir}/SRR67*.fastq.gz) --extract -o ${FastQCDir} |$ tee ${outFile}
mkdir ${tmp}/fastq_trimmed
TrimDir=${tmp}/fastq_trimmed
echo "Start TrimGalore" >> ${outFile}
trim_galore --gzip --illumina --fastqc --fastqc_args "--outdir ${FastQCDir}" --paired --stringency 5 -o ${TrimDir} $(ls ${fatqDir}/SRR67*.fastq.gz)
mkdir ${tmp}/genome
GenomeDir=${tmp}/genome
echo "Copying Genome files" >> ${outFile}
cp -r $SLURM_SUBMIT_DIR/genome/hg38_STARindex ${GenomeDir}
cp $SLURM_SUBMIT_DIR/genome/hg38.ensGene.gtf ${GenomeDir}
cp $SLURM_SUBMIT_DIR/genome/hg38.HouseKeepingGenes.bed ${GenomeDir}
AlignmDir=${tmp}/alignments
mkdir ${AlignmDir}
mkdir ${AlignmDir}/bamqc/
for sra in $(cat $SLURM_SUBMIT_DIR/raw_data/SRR_Acc_List.txt); do
sampleName=$(basename ${TrimDir}/${sra}*_1.fq.gz | cut -d '.' -f 1)
echo "Run STAR on " ${sampleName} >> ${outFile}
STAR --runMode alignReads \
--runThreadN 16 \
--readFilesIn ${TrimDir}/${sra}*_val_1.fq.gz ${TrimDir}/${sra}*_val_2.fq.gz \
--readFilesCommand zcat \
--genomeDir ${GenomeDir}/hg38_STARindex \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix ${AlignmDir}/${sampleName}.
echo "Run samtools index on " ${sampleName} >> ${outFile}
samtools index ${AlignmDir}/${sampleName}.Aligned.sortedByCoord.out.bam
echo "Run bamqc on " ${sampleName} >> ${outFile}
/softlib/apps/EL7/BamQC/bin/bamqc ${AlignmDir}/${sampleName}.Aligned.sortedByCoord.out.bam -o ${AlignmDir}/bamqc/
done
mkdir ${tmp}/RSeqc
QCDir=${tmp}/RSeqc
BED=${GenomeDir}/hg38.HouseKeepingGenes.bed
for sra in $(cat $SLURM_SUBMIT_DIR/raw_data/SRR_Acc_List.txt); do
# Create a directory to save the sample's results
sampleName=$(basename ${TrimDir}/${sra}*_1.fq.gz | cut -d '.' -f 1)
if [[ ! -e ${QCDir}/$sampleName ]]; then
mkdir ${QCDir}/$sampleName
fi
# Calculate the read distribution
read_distribution.py -i ${AlignmDir}/${sra}*.bam \
-r ${BED} > \
${QCDir}/$sampleName/rseqc_read_distribution.out
echo "read distribution End"
# Calculate the gene body coverage
geneBody_coverage.py -i ${AlignmDir}/${sra}*.bam \
-r ${BED} \
-o ${QCDir}/$sampleName/rseqc_geneBody_coverage.out
echo $sampleName "Done"
done
GTF=${GenomeDir}/hg38.ensGene.gtf
COUNTFILE=${tmp}/geneCounts.txt
# Run featureCounts
featureCounts -a ${GTF} \
-o ${COUNTFILE} \
-t 'exon' \
-p \
--verbose \
${AlignmDir}/*.bam
echo "coying the results back...." >> ${outFile}
cp -r ${AlignmDir}/* ${SLURM_SUBMIT_DIR}/alignments
mkdir ${SLURM_SUBMIT_DIR}/qc
cp -r ${FastQCDir}/* ${SLURM_SUBMIT_DIR}/qc
cp -r ${TrimDir} ${SLURM_SUBMIT_DIR}
cp -r ${QCDir} ${SLURM_SUBMIT_DIR}
cp ${tmp}/geneCounts* ${SLURM_SUBMIT_DIR}
echo "remove tmp directory" >> ${outFile}
rm -r ${tmp}
echo "Done!" >> ${outFile}
exitTo explore the effect of cDNA library storage time on RNA-seq output, I investigated the differences between two groups of CD4+ blood samples. The first group (orginal group) contains patient samples with original cDNA library, and the second group (3-year group) contains samples with the same cDNA library from the original group stored under -80^o for three years. In addition to that, one patient (ID:107) had a third sample was created by storing the RNA for three years, than used to construct the cDNA library. Figure 12 show an overview of the samples preparation and groups.
Figure 12: Sample Preparation Overview
First, I investigated the potential differences of sequencing biases between the two groups. For that I looked at the percentage of mapped reads across different genomic regions (i.e., intronic, exonic, intergenic and other regions) as shown in the Figure below. we can see that the distribution of the reads across genomic regions has little to no difference between the two groups.
read_dist[c('ID', "cond")]<- t(sapply(read_dist$donor, function(s) substring(s, c(1,4), c(3,6))))
pct_cols <- colnames(read_dist)[grepl("_pct", colnames(read_dist))]
df_read <- read_dist[(read_dist$cond == "FR") | (read_dist$cond == "SR") | (read_dist$cond == "SRN"), c("Sample","factor","donor", "ID", "cond",pct_cols )]
df_read <- gather(df_read, key="region", value="prct", X3_utr_exons_tag_pct:other_intergenic_tag_pct )
df_read$region <- gsub("_tag_pct", "", df_read$region)
old_regions <- unique(df_read$region)
new_regions <- c("3'UTR_Exons", "TES_down_5kb", "TSS_up_5kb","TES_down_1kb","CDS_exons", "5'UTR_exons","TSS_up_10kb","Introns","TSS_up_1kb","TES_down_10kb","Other_Intergenic")
mappings <- setNames(new_regions, old_regions)
args <- c(list(df_read$region), mappings)
df_read$region <- do.call(recode, args)
ggplot(df_read, aes(x=donor, y=prct, fill=region)) +
scale_fill_brewer( type='div', palette = "RdYlBu") +
geom_bar(position="stack", stat="identity") +
theme(plot.title = element_text(hjust = 0.5, face='bold'),
legend.position="bottom",
axis.text=element_text(size=9, face='bold'),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
xlab("Sample") + ylab("Percentages") +labs(title="Read Distribution")ggsave("./figures/Read_distribution_time.png")We also looked at the gene body coverage to see if there are any pronounce differences. The figure below shows the gene body coverage between the two groups. We see no differences between the two groups.
data <- fromJSON(file = "../project/multiqc/multiqc_data/multiqc_data.json")
gene_body_prc <- data$report_plot_data$rseqc_gene_body_coverage_plot$datasets[[1]]
gene_body_prc <- lapply(gene_body_prc, function(x) x$data)
gene_body_prc <- lapply(gene_body_prc, function(x) sapply(x, function(s) s[2]))
df <- data.frame(gene_body_prc)
colnames(df) <- read_dist$donor
df$x <- seq(1:100)
df <- gather(df, key = "Sample", value = "gene_body_prc", -x)
df[c('ID', "cond")]<- t(sapply(df$Sample, function(s) substring(s, c(1,4), c(3,6))))
df_time <- df[(df$cond == "FR") | (df$cond == "SR") | (df$cond == "SRN"), ]
p1 <- ggplot(df_time, aes(x = x, y = gene_body_prc, group=Sample, color=cond)) +
geom_line()+
theme( plot.title = element_text(hjust = 0.5, face='bold'),
axis.text=element_text(size=9, face='bold'),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
xlab("Gene Body Percentile (5' -> 3')") + ylab("%Coverage") +labs(title="Gene Body Coverage")+
scale_colour_brewer( type='div', palette = "Set1")
p1 #ggsave('Gene_Body_Coverage_time.png')In this section we look at the gene expression abundance and differentially expressed genes between the two groups. First, we used CPM normalized count to compare the gene counts between the samples of the two groups. The density distribution of gene expression values was highly similar between all the samples. The expression of the genes that are either protein coding or lincRNA was concordant between all the samples as shown in the figures below.
folder <- "./"
readcounts <- paste0(folder,"geneCounts_v2.txt") %>% read.table(., header=TRUE)
readcounts <- readcounts[, -c(7)]
keep_cols <- colnames(readcounts)[grepl("time", colnames(readcounts))]
readcounts <- readcounts[, c(names(readcounts)[1:6], keep_cols)]
sample_names <- unlist(lapply(strsplit(colnames(readcounts)[7:19], "[.]"), function(x) x[5]))
sample_names <- unlist(lapply(strsplit(sample_names, "_"), function(x) paste(x[2:(length(x)-3)], collapse="_")))
names(readcounts) <- c(names(readcounts)[1:6], sample_names)
row.names(readcounts) <- make.names(readcounts$Geneid)
readcounts <- readcounts[ , -c(1:6)]cpm <- data.frame(assay(DESeq.ds, "CPM") )
colnames(cpm) <- colnames(DESeq.ds)
cpm$GENEID <- rownames(cpm)
df_cpm <- cpm %>%
gather(colnames(cpm), key = "Sample", value = "CPM", -GENEID)
df_cpm[c('ID', "cond")]<- t(sapply(df_cpm$Sample, function(s) substring(s, c(1,4), c(3,6))))
ggplot(df_cpm, aes(x=CPM,group=Sample, color=cond )) +
geom_density() +
scale_color_brewer( type='div', palette = "Set1") +
geom_rug(sides="b") +
scale_x_log10()+
theme(plot.title = element_text(hjust = 0.5, face='bold'))+
labs(title='Density distribution of the gene CPM value of all samples', caption = "FR:original libraries; SR:3-year stored libraries")+
ylab('Density')#ggsave("Density_distribution_time.png")p1 <- ggplot(df_cpm, aes(y=log10(CPM), x=Sample, fill=cond )) +
geom_boxplot() +
scale_fill_brewer( type='div', palette = "Set1") +
theme(plot.title = element_text(hjust = 0.5, face='bold'),
axis.text=element_text(size=9, face='bold'),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(title="Distribution of the gene CPM value of all samples")
df_cpm2 <- merge(x= df_cpm, y= gene_details, by = "GENEID", all.x = TRUE)
p2 <- ggplot(df_cpm2[!is.na(df_cpm2$TXBIOTYPE), ], aes(y=log10(CPM), x=TXBIOTYPE, fill=Sample)) +
geom_boxplot() +
scale_fill_brewer( type='div', palette = "RdYlBu") +
theme(plot.title = element_text(hjust = 0.5, face='bold'),
axis.text=element_text(size=11, face='bold'),
axis.title=element_text(size=9,face="bold"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
xlab(label = "") + labs(title= "Comparison of the CPM values between\n lincRNAs and protein-coding genes ", caption = "FR:original libraries; SR:3-year stored libraries")
(p1 | p2) + plot_annotation(tag_levels = 'A')#ggsave('CPM_Boxplots.png')Furthermore, three RNA libraries of donor ID:107 (i.e., 107FR for the original library, 107SR for the 3-year library, and 107SRN for the new constructed library from the same RNA after three years), exhibited highly correlated expression patterns as shown in the following figure.
df <- DESeq.ds[, c("107SR","107FR")] %>% assay(., "CPM") %>% data.frame
colnames(df) <- c("107SR","107FR")
p1 <- ggplot(df, aes(x=`107SR`, y=`107FR`)) + geom_point(size=3)+
stat_smooth(method="lm", se=FALSE, formula = y ~ x) +
stat_fit_glance( aes(label = sprintf('r^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))), parse = TRUE, size = 5)
df <- DESeq.ds[, c("107SR","107SRN")] %>% assay(., "CPM") %>% data.frame
colnames(df) <- c("107SR","107SRN")
p2 <- ggplot(df, aes(x=`107SR`, y=`107SRN`)) + geom_point(size=3)+
stat_smooth(method="lm", se=FALSE, formula = y ~ x) +
stat_fit_glance( aes(label = sprintf('r^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))), parse = TRUE, size = 5)
df <- DESeq.ds[, c("107FR","107SRN")] %>% assay(., "CPM") %>% data.frame
colnames(df) <- c("107FR","107SRN")
p3 <- ggplot(df, aes(x=`107FR`, y=`107SRN`)) + geom_point(size=3)+
stat_smooth(method="lm", se=FALSE, formula = y ~ x) +
stat_fit_glance( aes(label = sprintf('r^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))), parse = TRUE, size = 5)
(p1 | p2 | p3) + plot_annotation(tag_levels = 'A')#ggsave("corr_107.png")Next, we looked at the PCA clustering of the samples to see if they cluster based on the condition (i.e., original library, stored library) or the patient/donor. The PCA plot suggests that the two RNA-seq samples from the same patient are closest despite the difference of their storage conditions. The same thing is shown by looking at the pearson correlation matrix between the samples and the hierarchical clustering as shown in the figures below. For the three plots I used rlog transformed counts since we are comparing the gene expression profiles between samples.
# For PCA and DE analysis the rlog normalized counts are used.
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
rv <- rowVars(assay(DESeq.rlog))
# Obtain the indecies of the top 500 variable genes
top_variable <- order(rv, decreasing = TRUE)[seq_len(500)]
# Compute the PCAs based on the rlog normalized gene expression
pca <- prcomp(t(assay(DESeq.rlog)[top_variable, ]))
# data.frame contains informtion for each sample. it will be used for PCA plot
data <- data.frame(colData(DESeq.ds))
# Rename columns
colnames(data) <- c("factor","donor", "ID", "Condition" )
# Plot the two top PCs
p1 <- autoplot(pca, data=data, colour = 'Condition', label = TRUE, label.size = 4, shape = FALSE) +
scale_colour_brewer( type='qual', palette = "Set1") +
theme_classic()+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face='bold')) +
labs(title="PCA plot for Library Storage Time Imapct Samples", caption = "FR:original libraries; SR:3-year libraries")
#ggsave('PCA_time.png')
p1sample_df <- as.data.frame(colData(DESeq.ds)[c('ID','cond')])
colnames(sample_df) <- c('Donor ID', 'Condition')
ann_colors = list(Condition = c(FR = "#8DA0CB", SR = "#FC8D62", SRN="#66C2A5"))
corr_coeff <- cor(assay(DESeq.rlog), method = "pearson")
p1 <- as.dist(1-corr_coeff, upper = TRUE) %>%
as.matrix %>%
pheatmap::pheatmap(., main = "Pearson Distance", annotation_row = sample_df, annotation_colors = ann_colors)# Run pvclust
d.pv <- as.dendrogram(pvclust(corr_coeff, nboot=10, quiet=T)$hclust)
ddata <- dendro_data(d.pv, type = "rectangle")
# Get data frames to plot
df_seg <- segment(ddata)
df_labs <- data.frame(label(ddata), ID = as.factor(sample_df[match(label(ddata)$label, rownames(sample_df)), "Donor ID"]))
# Create ggplot dendrogram
p <- ggplot()+
geom_segment(data = df_seg,
aes(x = x, y = y, xend = xend, yend = yend),
size = 1.25,
colour = "darkgray", lineend = "round") +
geom_text(data = df_labs, aes(x = x, y = y, label = label, colour = ID),
nudge_y = 0,
family = "serif", size = 4, angle = 90, hjust = 1, fontface='bold') +
scale_colour_brewer( type='div', palette = "Dark2") +
xlab("") + ylab("Height")
p <- p + theme(axis.line.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
text = element_text(family = "serif")) +
# theme_gray()+
scale_y_continuous(expand = expand_scale(add = c(0.25, 0)))+
theme_classic() +
theme(axis.line.x = element_blank(), axis.ticks.x =element_blank(),
axis.text.x= element_blank(),legend.position = "none",
plot.title = element_text(hjust = 0.5, face='bold')) +
labs(title = "Cluster Dendrogram")
#ggsave('dendrogram_time.png')
pLastly, we looked at the differential gene expression between the original and 3-years group. We found 15 differentially expressed genes between groups. This suggest that cDNA library storage time has not introduced any biases in the samples that may effect the analysis when comparing the samples from different patient. A summary of the DE analysis is shown in the table below.
DE_ds <- DESeq.ds[, DESeq.ds$cond != "SRN" ]
colData(DE_ds)$cond <- factor(as.character(colData(DESeq.ds[, DESeq.ds$cond != "SRN" ])$cond), levels=c("FR", "SR"))
res <- DESeq(DE_ds)
res_df <- as.data.frame(results(res))
res_df$SYMBOL <- gene_symbol[match(rownames(res_df), names(gene_symbol))]
res_df <- res_df[(res_df$padj < 0.05) & (!is.na(res_df$padj)), ]
res_df <- res_df[order(res_df$padj),]
kable(res_df) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | SYMBOL | |
|---|---|---|---|---|---|---|---|
| ENSG00000213399 | 34.560165 | 5.219482 | 0.5226264 | 9.987022 | 0.00e+00 | 0.0000000 | AC022210.2 |
| ENSG00000229117 | 6795.090428 | 1.215919 | 0.1274798 | 9.538131 | 0.00e+00 | 0.0000000 | RPL41 |
| ENSG00000212664 | 35.855870 | -2.475455 | 0.3218213 | -7.692018 | 0.00e+00 | 0.0000000 | RP11-592N21.1 |
| ENSG00000273136 | 249.350865 | 1.871904 | 0.2539311 | 7.371701 | 0.00e+00 | 0.0000000 | NBPF26 |
| ENSG00000229994 | 18.202757 | 2.644336 | 0.3775303 | 7.004303 | 0.00e+00 | 0.0000000 | RPL5P4 |
| ENSG00000225573 | 10.533188 | 3.697417 | 0.6551633 | 5.643504 | 0.00e+00 | 0.0000574 | RPL35P5 |
| ENSG00000109471 | 37.953389 | -3.297028 | 0.6056449 | -5.443831 | 1.00e-07 | 0.0001541 | IL2 |
| ENSG00000111537 | 34.632150 | -1.348320 | 0.2538339 | -5.311820 | 1.00e-07 | 0.0002245 | IFNG |
| ENSG00000223551 | 24.775425 | 1.803860 | 0.3378907 | 5.338590 | 1.00e-07 | 0.0002245 | TMSB4XP4 |
| ENSG00000227008 | 44.714744 | 1.079535 | 0.2032282 | 5.311933 | 1.00e-07 | 0.0002245 | RP3-417G15.1 |
| ENSG00000218426 | 157.384940 | -1.008331 | 0.2224115 | -4.533628 | 5.80e-06 | 0.0109001 | RP11-475C16.1 |
| ENSG00000164400 | 5.651494 | -3.568857 | 0.8237855 | -4.332265 | 1.48e-05 | 0.0254335 | CSF2 |
| ENSG00000225131 | 9.828745 | 1.868007 | 0.4333528 | 4.310593 | 1.63e-05 | 0.0259005 | PSME2P2 |
| ENSG00000229204 | 4.675141 | -3.583913 | 0.8463938 | -4.234333 | 2.29e-05 | 0.0338607 | PTGES3P3 |
| ENSG00000238000 | 7.993616 | 2.175253 | 0.5169151 | 4.208144 | 2.57e-05 | 0.0354974 | RP11-274E7.2 |
Cryopreservation usually results in an increased proportion of damaged cells, which could potentially affect the outputs of RNA-seq experiments. To explore the effect of of sample cryopreservation on RNA-seq output, we investigated the differences between two groups of fresh (i.e. unfrozen group) and cryopreserved (i.e. frozen group) CD4+ blood samples derived from 4 healthy patients, where each subject has both fresh and cryopreserved sample. Figure 13 show an overview of the samples preparation and groups.
Figure 13: Sample Preparation Overview
Looking at the percentage of mapped reads across different genomic regions shows no effect of sample cryopreservation on the RNA-seq experiment.
df_read <- read_dist[(read_dist$cond == "f") | (read_dist$cond == ""), c("Sample","factor","donor", "ID", "cond",pct_cols )]
df_read <- gather(df_read, key="region", value="prct", X3_utr_exons_tag_pct:other_intergenic_tag_pct )
df_read$region <- gsub("_tag_pct", "", df_read$region)
old_regions <- unique(df_read$region)
new_regions <- c("3'UTR_Exons", "TES_down_5kb", "TSS_up_5kb","TES_down_1kb","CDS_exons", "5'UTR_exons","TSS_up_10kb","Introns","TSS_up_1kb","TES_down_10kb","Other_Intergenic")
mappings <- setNames(new_regions, old_regions)
args <- c(list(df_read$region), mappings)
df_read$region <- do.call(recode, args)
ggplot(df_read, aes(x=donor, y=prct, fill=region)) +
scale_fill_brewer( type='div', palette = "RdYlBu") +
geom_bar(position="stack", stat="identity") +
theme(plot.title = element_text(hjust = 0.5, face='bold'),
legend.position="bottom", axis.text=element_text(size=10, face='bold'),
axis.title=element_text(hjust=0.5, size=12,face="bold"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
xlab("Sample") + ylab("Percentages") +labs(title="Read Distribution")The gene body coverage shows a bias towards the 5’ end for fresh/unfrozen samples, while the frozen samples show a bias towards the 3’ end. But overall the gene body coverage follows the same trend across all the samples.
library("rjson")
data <- fromJSON(file = "../project/multiqc/multiqc_data/multiqc_data.json")
gene_body_prc <- data$report_plot_data$rseqc_gene_body_coverage_plot$datasets[[1]]
gene_body_prc <- lapply(gene_body_prc, function(x) x$data)
gene_body_prc <- lapply(gene_body_prc,function(x) sapply(x, function(s) s[2]))
df <- data.frame(gene_body_prc)
colnames(df) <- read_dist$donor
df$x <- seq(1:100)
df <- gather(df, key = "Sample", value = "gene_body_prc", -x)
df[c('ID', "cond")]<- t(sapply(df$Sample, function(s) substring(s, c(1,4), c(3,6))))
df_frozen <- df[(df$cond == "f") | (df$cond == "") , ]
df_frozen <- df_frozen %>%
mutate(cond=replace(cond, cond=="f", "Frozen")) %>%
mutate(cond=replace(cond, cond=="", "Unfrozen"))
p1 <- ggplot(df_frozen, aes(x = x, y = gene_body_prc, group=Sample, color=cond)) +
geom_line()+
theme( plot.title = element_text(hjust = 0.5, face='bold'),
axis.text=element_text(size=9, face='bold'),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
xlab("Gene Body Percentile (5' -> 3')") + ylab("%Coverage") +labs(title="Gene Body Coverage")+
scale_colour_brewer( type='div', palette = "Set1")
p1The density distribution of gene expression values was highly concordant between all the samples. The expression of the genes that are either protein coding or lincRNA was similar between all the samples as shown in the figures below.
cpm <- data.frame(assay(DESeq.ds, "CPM") )
colnames(cpm) <- colnames(DESeq.ds)
cpm$GENEID <- rownames(cpm)
df_cpm <- cpm %>%
gather(colnames(cpm), key = "Sample", value = "CPM", -GENEID)
df_cpm[c('ID', "cond")]<- t(sapply(df_cpm$Sample, function(s) substring(s, c(1,4), c(3,6))))
df_cpm <- df_cpm %>%
mutate(cond=replace(cond, cond=="f", "Frozen")) %>%
mutate(cond=replace(cond, cond=="", "Unfrozen"))
ggplot(df_cpm, aes(x=CPM,group=Sample, color=cond )) +
geom_density() +
scale_color_brewer( type='div', palette = "Set1") +
geom_rug(sides="b") +
scale_x_log10()+
theme(plot.title = element_text(hjust = 0.5, face='bold'))+
labs(title='Density distribution of the gene CPM value of all samples', caption = "Frozen:cryopreserved samples, Unfrozen: fresh samples")+
ylab('Density')p1 <- ggplot(df_cpm, aes(y=log10(CPM), x=Sample, fill=cond )) +
geom_boxplot() +
scale_fill_brewer( type='div', palette = "Set1") +
theme(axis.text=element_text(size=9, face='bold'),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(title="Distribution of the gene CPM value of all samples")
df_cpm2 <- merge(x= df_cpm, y= gene_details, by = "GENEID", all.x = TRUE)
p2 <- ggplot(df_cpm2[!is.na(df_cpm2$TXBIOTYPE), ], aes(y=log10(CPM), x=TXBIOTYPE, fill=Sample)) +
geom_boxplot() +
scale_fill_brewer( type='div', palette = "RdYlBu") +
theme(plot.title = element_text(hjust = 0.5, face='bold'),
axis.text=element_text(size=11, face='bold'),
axis.title=element_text(size=12,face="bold"),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
xlab(label = "") + labs(title= "Comparison of the CPM values between\nlincRNAs and protein-coding genes ", caption = "Frozen:cryopreserved samples, Unfrozen: fresh samples")
(p1 | p2) + plot_annotation(tag_levels = 'A')The PCA plot suggest that the two RNA-seq samples from same individuals were more similar than those under same cryopreserved conditions. The same thing is shown by looking at the pearson correlation matrix between the samples and performing hierarchical clustering, except for the fresh sample from donor 501 that was clustered away from its frozen sample as shown in the figures below.
# For PCA analysis the rlog normalized counts are used.
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
rv <- rowVars(assay(DESeq.rlog))
# Obtain the indecies of the top 500 variable genes
top_variable <- order(rv, decreasing = TRUE)[seq_len(500)]
# Compute the PCAs based on the rlog normalized gene expression
pca <- prcomp(t(assay(DESeq.rlog)[top_variable, ]))
# data.frame contains informtion for each sample. it will be used for PCA plot
data <- data.frame(colData(DESeq.ds))
# rename columns
colnames(data) <- c("factor","donor", "ID", "Condition" )
# Plot the two top PCs
p1 <- autoplot(pca, data=data, colour = 'Condition', label = TRUE, label.size = 4, shape = FALSE) +
scale_colour_brewer( type='qual', palette = "Set1") +
theme_classic()+
theme(plot.title = element_text(hjust = 0.5, face='bold'),
legend.position="bottom") +
labs(title="PCA plot for Cryopreservation Imapct Samples", caption = "Frozen:cryopreserved samples, Unfrozen: fresh samples")
p1sample_df <- as.data.frame(colData(DESeq.ds)[c('ID','cond')])
colnames(sample_df) <- c('Donor ID', 'Condition')
ann_colors = list(Condition = c(Frozen = "#8DA0CB", Unfrozen = "#FC8D62"))
corr_coeff <- cor(assay(DESeq.rlog), method = "pearson")
p1 <- as.dist(1-corr_coeff, upper = TRUE) %>%
as.matrix %>%
pheatmap::pheatmap(., main = "Pearson Distance", annotation_row = sample_df, annotation_colors = ann_colors)# Run pvclust
d.pv <- as.dendrogram(pvclust(corr_coeff, nboot=10, quiet=T)$hclust)
ddata <- dendro_data(d.pv, type = "rectangle")
# Get data frames to plot
df_seg <- segment(ddata)
df_labs <- data.frame(label(ddata), ID = as.factor(sample_df[match(label(ddata)$label, rownames(sample_df)), "Donor ID"]))
# Create ggplot dendrogram
p <- ggplot()+
geom_segment(data = df_seg,
aes(x = x, y = y, xend = xend, yend = yend),
size = 1.25,
colour = "darkgray", lineend = "round") +
geom_text(data = df_labs, aes(x = x, y = y, label = label, colour = ID),
nudge_y = 0,
family = "serif", size = 4, angle = 90, hjust = 1, fontface='bold') +
scale_colour_brewer( type='div', palette = "Dark2") +
xlab("") + ylab("Height")
p <- p + theme(axis.line.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
text = element_text(family = "serif")) +
# theme_gray()+
scale_y_continuous(expand = expand_scale(add = c(0.25, 0.5)))+
theme_classic() +
theme(axis.line.x = element_blank(), axis.ticks.x =element_blank(),
axis.text.x= element_blank(),legend.position = "none",
plot.title = element_text(hjust = 0.5, face='bold')) +
labs(title = "Cluster Dendrogram")
pLastly, we looked at the differential gene expression between the fresh and frozen group. We found 71 differentially expressed genes between groups. This suggest that sample cryopreservation has introduced some biases in the samples. This requires a look at the genes that were differentially expressed and see if they are random or they share a similar characteristics. A summary of the DE analysis is shown in the table below.
colData(DESeq.ds)$cond <- factor(as.character(colData(DESeq.ds)$cond), levels=c("Unfrozen", "Frozen"))
res <- DESeq(DESeq.ds)
res_df <- as.data.frame(results(res))
res_df$SYMBOL <- gene_symbol[match(rownames(res_df), names(gene_symbol))]
res_df <- res_df[(res_df$padj < 0.05) & (!is.na(res_df$padj)), ]
res_df <- res_df[order(res_df$padj),]
kable(res_df) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | SYMBOL | |
|---|---|---|---|---|---|---|---|
| ENSG00000247982 | 122.9535 | 2.0932760 | 0.3401249 | 6.154433 | 0.0000000 | 0.0000062 | LINC00926 |
| ENSG00000210140 | 577.2135 | 1.5977724 | 0.2633113 | 6.067999 | 0.0000000 | 0.0000062 | MT-TC |
| ENSG00000125740 | 496.4931 | 3.6827282 | 0.6200794 | 5.939123 | 0.0000000 | 0.0000092 | FOSB |
| ENSG00000210144 | 274.2191 | 1.7720618 | 0.3112996 | 5.692464 | 0.0000000 | 0.0000302 | MT-TY |
| ENSG00000090104 | 263.8083 | 2.2489824 | 0.4162041 | 5.403556 | 0.0000001 | 0.0001261 | RGS1 |
| ENSG00000198695 | 20494.6069 | 1.5550274 | 0.2982592 | 5.213678 | 0.0000002 | 0.0002977 | MT-ND6 |
| ENSG00000071537 | 1523.1205 | -0.3828004 | 0.0754102 | -5.076243 | 0.0000004 | 0.0005306 | SEL1L |
| ENSG00000163534 | 139.1476 | 2.5208469 | 0.5048343 | 4.993415 | 0.0000006 | 0.0006359 | FCRL1 |
| ENSG00000101347 | 9278.7095 | -0.3826888 | 0.0766123 | -4.995137 | 0.0000006 | 0.0006359 | SAMHD1 |
| ENSG00000248527 | 3981.3000 | 1.6383975 | 0.3317795 | 4.938212 | 0.0000008 | 0.0007607 | MTATP6P1 |
| ENSG00000177606 | 1461.6758 | 3.4222802 | 0.7126031 | 4.802505 | 0.0000016 | 0.0013743 | JUN |
| ENSG00000140526 | 1877.2060 | -0.4907780 | 0.1026225 | -4.782361 | 0.0000017 | 0.0013929 | ABHD2 |
| ENSG00000225630 | 2106.6768 | 1.5948404 | 0.3399511 | 4.691382 | 0.0000027 | 0.0016878 | MTND2P28 |
| ENSG00000237973 | 267.2047 | 1.7148056 | 0.3660168 | 4.685046 | 0.0000028 | 0.0016878 | MTCO1P12 |
| ENSG00000136485 | 1902.1383 | -0.5928479 | 0.1254639 | -4.725246 | 0.0000023 | 0.0016878 | DCAF7 |
| ENSG00000198840 | 15925.4121 | 1.4773019 | 0.3143308 | 4.699831 | 0.0000026 | 0.0016878 | MT-ND3 |
| ENSG00000007312 | 310.0087 | 0.6064963 | 0.1303061 | 4.654398 | 0.0000032 | 0.0018035 | CD79B |
| ENSG00000137409 | 2758.0218 | 0.3615614 | 0.0778020 | 4.647201 | 0.0000034 | 0.0018035 | MTCH1 |
| ENSG00000156738 | 343.4310 | 3.1099976 | 0.6724906 | 4.624596 | 0.0000038 | 0.0019059 | MS4A1 |
| ENSG00000177485 | 815.4753 | -0.4990550 | 0.1100175 | -4.536143 | 0.0000057 | 0.0027638 | ZBTB33 |
| ENSG00000228253 | 11130.3689 | 1.6020981 | 0.3542848 | 4.522062 | 0.0000061 | 0.0028135 | MT-ATP8 |
| ENSG00000258441 | 946.2335 | 0.5066317 | 0.1143183 | 4.431766 | 0.0000093 | 0.0040988 | LINC00641 |
| ENSG00000231721 | 548.1144 | 0.6608080 | 0.1531602 | 4.314489 | 0.0000160 | 0.0067105 | LINC-PINT |
| ENSG00000141367 | 2691.6656 | -0.4902580 | 0.1141841 | -4.293577 | 0.0000176 | 0.0070679 | CLTC |
| ENSG00000102901 | 1273.9638 | 0.3100813 | 0.0727763 | 4.260743 | 0.0000204 | 0.0078631 | CENPT |
| ENSG00000155287 | 764.7964 | 0.3912217 | 0.0930692 | 4.203557 | 0.0000263 | 0.0090537 | SLC25A28 |
| ENSG00000089327 | 4073.0997 | 0.3750953 | 0.0891546 | 4.207246 | 0.0000259 | 0.0090537 | FXYD5 |
| ENSG00000005893 | 1168.7492 | -0.4178340 | 0.0993964 | -4.203715 | 0.0000263 | 0.0090537 | LAMP2 |
| ENSG00000178188 | 2098.8204 | 0.2853805 | 0.0711636 | 4.010200 | 0.0000607 | 0.0189172 | SH2B1 |
| ENSG00000158552 | 693.5175 | 0.3726694 | 0.0928264 | 4.014693 | 0.0000595 | 0.0189172 | ZFAND2B |
| ENSG00000210194 | 150.7174 | 1.3518971 | 0.3371524 | 4.009750 | 0.0000608 | 0.0189172 | MT-TE |
| ENSG00000253729 | 4955.7743 | -0.4088103 | 0.1021872 | -4.000601 | 0.0000632 | 0.0190494 | PRKDC |
| ENSG00000153201 | 3482.5508 | -0.4618723 | 0.1157512 | -3.990217 | 0.0000660 | 0.0192997 | RANBP2 |
| ENSG00000185009 | 856.5723 | -0.4275459 | 0.1085783 | -3.937673 | 0.0000823 | 0.0223709 | AP3M1 |
| ENSG00000172890 | 2670.8198 | 0.2737857 | 0.0695198 | 3.938241 | 0.0000821 | 0.0223709 | NADSYN1 |
| ENSG00000171302 | 611.6765 | -0.4030755 | 0.1025288 | -3.931340 | 0.0000845 | 0.0223709 | CANT1 |
| ENSG00000153827 | 4277.3621 | -0.2958935 | 0.0753367 | -3.927615 | 0.0000858 | 0.0223709 | TRIP12 |
| ENSG00000057252 | 983.4377 | -0.2986947 | 0.0761829 | -3.920757 | 0.0000883 | 0.0224116 | SOAT1 |
| ENSG00000131876 | 1055.2962 | 0.3496586 | 0.0894054 | 3.910935 | 0.0000919 | 0.0227445 | SNRPA1 |
| ENSG00000115307 | 1756.8106 | 0.3464496 | 0.0891941 | 3.884220 | 0.0001027 | 0.0243501 | AUP1 |
| ENSG00000105887 | 2191.7146 | -0.3742273 | 0.0963934 | -3.882289 | 0.0001035 | 0.0243501 | MTPN |
| ENSG00000175216 | 1204.2976 | -0.4545594 | 0.1174169 | -3.871327 | 0.0001082 | 0.0246772 | CKAP5 |
| ENSG00000135164 | 3461.8329 | 0.3078830 | 0.0796090 | 3.867441 | 0.0001100 | 0.0246772 | DMTF1 |
| ENSG00000165934 | 1086.4042 | -0.4350308 | 0.1133018 | -3.839575 | 0.0001232 | 0.0270248 | CPSF2 |
| ENSG00000215271 | 214.6026 | -0.4803380 | 0.1259479 | -3.813785 | 0.0001369 | 0.0287038 | HOMEZ |
| ENSG00000141027 | 3631.4037 | -0.3838474 | 0.1005618 | -3.817031 | 0.0001351 | 0.0287038 | NCOR1 |
| ENSG00000125107 | 5173.1848 | -0.4073286 | 0.1070721 | -3.804245 | 0.0001422 | 0.0291980 | CNOT1 |
| ENSG00000172315 | 384.5109 | -0.5335844 | 0.1411938 | -3.779091 | 0.0001574 | 0.0316377 | TP53RK |
| ENSG00000197102 | 8720.0053 | -0.3150027 | 0.0836305 | -3.766601 | 0.0001655 | 0.0325836 | DYNC1H1 |
| ENSG00000148341 | 1528.0802 | 0.2600918 | 0.0693389 | 3.751022 | 0.0001761 | 0.0339831 | SH3GLB2 |
| ENSG00000085719 | 1109.1622 | -0.3666108 | 0.0978709 | -3.745863 | 0.0001798 | 0.0340092 | CPNE3 |
| ENSG00000104957 | 898.1565 | 0.3283645 | 0.0879123 | 3.735139 | 0.0001876 | 0.0348092 | CCDC130 |
| ENSG00000111674 | 1424.9214 | 0.3238274 | 0.0879695 | 3.681134 | 0.0002322 | 0.0407319 | ENO2 |
| ENSG00000114978 | 2929.4948 | -0.3116295 | 0.0845244 | -3.686857 | 0.0002270 | 0.0407319 | MOB1A |
| ENSG00000144426 | 350.9537 | -0.4413756 | 0.1198982 | -3.681251 | 0.0002321 | 0.0407319 | NBEAL1 |
| ENSG00000279088 | 306.0024 | 0.4482117 | 0.1228121 | 3.649572 | 0.0002627 | 0.0423861 | RP11-574K11.24 |
| ENSG00000185591 | 2680.0752 | -0.3623868 | 0.0988498 | -3.666033 | 0.0002463 | 0.0423861 | SP1 |
| ENSG00000100596 | 1231.8482 | -0.5216516 | 0.1426137 | -3.657795 | 0.0002544 | 0.0423861 | SPTLC2 |
| ENSG00000140983 | 2464.9095 | 0.3535820 | 0.0967378 | 3.655054 | 0.0002571 | 0.0423861 | RHOT2 |
| ENSG00000108582 | 981.6010 | -0.3383162 | 0.0928311 | -3.644427 | 0.0002680 | 0.0423861 | CPD |
| ENSG00000087074 | 854.3845 | 0.9810162 | 0.2691466 | 3.644914 | 0.0002675 | 0.0423861 | PPP1R15A |
| ENSG00000114331 | 2209.8550 | -0.2915671 | 0.0801396 | -3.638240 | 0.0002745 | 0.0427169 | ACAP2 |
| ENSG00000198712 | 60915.3453 | 1.7008518 | 0.4681201 | 3.633367 | 0.0002797 | 0.0428413 | MT-CO2 |
| ENSG00000137145 | 1650.1165 | -0.3922689 | 0.1083559 | -3.620191 | 0.0002944 | 0.0436960 | DENND4C |
| ENSG00000198763 | 43155.6931 | 1.6904637 | 0.4667793 | 3.621548 | 0.0002928 | 0.0436960 | MT-ND2 |
| ENSG00000171163 | 779.5569 | 0.3860699 | 0.1073026 | 3.597955 | 0.0003207 | 0.0468847 | ZNF692 |
| ENSG00000196405 | 13775.8606 | 0.2408411 | 0.0670289 | 3.593095 | 0.0003268 | 0.0470553 | EVL |
| ENSG00000080824 | 7234.8730 | -0.3266137 | 0.0910561 | -3.586951 | 0.0003346 | 0.0474691 | HSP90AA1 |
| ENSG00000134453 | 2216.7016 | 0.2388790 | 0.0668905 | 3.571193 | 0.0003554 | 0.0489786 | RBM17 |
| ENSG00000119682 | 1192.2457 | -0.2985510 | 0.0835629 | -3.572772 | 0.0003532 | 0.0489786 | AREL1 |
| ENSG00000108465 | 3239.9714 | 0.3933699 | 0.1103520 | 3.564684 | 0.0003643 | 0.0495030 | CDK5RAP3 |